Project - Unsupervised Learning

by HARI SAMYNAATH S

User Defined functions / classes

Part ONE

DOMAIN: Automobile
CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.
DATA DESCRIPTION:

Variable Type
cylinders multi-valued discrete
acceleration continuous
displacement continuous
model year multi-valued discrete
horsepower continuous
origin multi-valued discrete
weight continuous
car name string (unique for each instance)
mpg continuous

PROJECT OBJECTIVE: To understand K-means Clustering by applying on the Car Dataset to segment the cars into various categories.

STEPS AND TASK:
1. Data Understanding & Exploration:
A. Read ‘Car name.csv’ as a DataFrame and assign it to a variable.
B. Read ‘Car-Attributes.json as a DataFrame and assign it to a variable.
C. Merge both the DataFrames together to form a single DataFrame.
D. Print 5 point summary of the numerical features and share insights.

both datasets could be related by index

the dataset contains both numeric and object datatypes

the fuel efficieny ranges over 9 to 46.6, obviously since displacement, weight & acceleration too varies widely.
also the scales are quite different, hence standardisation will be necessary.

note that hp column is missing in the above 5 point summary, probably due to some unexpected type casting or string values in the feature.
let us review the 5 point summary, after data cleaning once again for better understanding

2. Data Preparation & Analysis:
A. Check and print feature-wise percentage of missing values present in the data and impute with the best suitable approach.
B. Check for duplicate values in the data and impute with the best suitable approach.
H. Check for unexpected values in all the features and datapoints with such values.

There are no NULLs, NANs and blank fields in the dataset
But there are 6 unexpected values in "hp" column
Since the percentage of unexpected values in just 1.51%, it could very well be dropped. But, for the sake of learning, lets review the hp column and impute the unexpected value accordingly

there are "?" symbols marking unknown/missing values
Lets replace those with NaN

the distribution has not shifted much after imputing
All unexpected values have been imputed succesfully
lets check for duplicates

No duplicate records

No duplicate values in the attributes also

Yes, there are repeated car names
lets try to append model year to the car name

custom name created
Now lets recheck for duplicates in car_name_2

Duplicates still available
Lets try to add information about low kerb weight

succesfully named
if all additional columns are dropped, then there will be no duplicates

lets check for duplicate car_name

Hence all duplicates imputed succesfully

2. Data Preparation & Analysis:
C. Plot a pairplot for all features.

the following inferences could be made from the above distribution

  1. weight directly impacts mileage
  2. increased weight calls for more engine displacement
  3. greater the engine displacement, higher the horsepower
  4. more engine displacement will result in lower mileage, but at car level, the factors include weight, acceleration etc.
  5. the industry has shown improvement in milage over the years
  6. the progress over years is towards lower displacement and lesser weight, providing slightly better acceleration without compromise on mileage
  7. Japanese cars (origin 3) showcase least weighs, while American cars (origin 1) boasts more muscles

the following inferences could be made from the above distribution

  1. lower milage cars have dominated the industry
  2. the industry sticks to lesser number of cylinders, lesser displacements, lower horsepowers and lower weights as a favourite
  3. the industry manages to achieve a nominal acceleration around 15 units
  4. the dataset fairly represents cars over 1970 to 1980
  5. the dataset contains more of American cars (origin 1)

2. Data Preparation & Analysis:
D. Visualize a scatterplot for ‘wt’ and ‘disp’. Datapoints should be distinguishable by ‘cyl’.
E. Share insights for Q2.d.
F. Visualize a scatterplot for ‘wt’ and ’mpg’. Datapoints should be distinguishable by ‘cyl’.
G. Share insights for Q2.f.

Inferences:
as weight of the car increases, engine displacement need to be increased to handle the higher loads
higher engine displacement is achieved using more number of cylinders
only American cars (origin 1) has 8 cylinders
only Japanese cars (origin 3) has 3 cylinders only German cars (origin 2) has 5 cylinders

Inferences:
Increased weight drastically impacts mileage, more the weight less the mileage
more number of cylinders could mean lower mileage
yet, 4 cylinder cars have exhibited better mileage than 3 cylinder cars of comparable weights

3. Clustering:
A. Apply K-Means clustering for 2 to 10 clusters.
B. Plot a visual and find elbow point.
C. On the above visual, highlight which are the possible Elbow points.

from the above graph it is evident that a cluster number of 3 is best
to enhance granularity, lets choose the second best : 5 clusters

3. Clustering:
D. Train a K-means clustering model once again on the optimal number of clusters.
E. Add a new feature in the DataFrame which will have labels based upon cluster value.
F. Plot a visual and color the datapoints based upon clusters.

the above plot give a picture of clustering results
yet to study the cluster reasoning, lets make custom charts as below

note: refering to cluster colours as cluster numbers could change if retrained, but the clusters remain the same unless data changes and the colours remain the same unless colourmap is changed

It could be seen from above visualisations that PURPLE CLUSTER represents the most efficient cars, built to be light weight and that are creations of technological advancements (like low displacement engines) as years passed by
Interestingly, RED CLUSTER represents the industry's age old attemps to improve mileage by merely keeping the weights low, but has not helped them acheive as good results as the latest improvements in the powertrain technologies as seen in PURPLE CLUSTER
In contrast to PURPLE CLUSTER, the BLUE CLUSTER represents the old age heavy weight cars that had large displacement engine given poor mileage
GREEN CLUSTER represents cars over the entire history, with close to median weights and below median efficiencies
Though ORANGE CLUSTER cars are releatively new age light weight cars, yet their mileage are not up to the range of PURPLE CLUSTER, this could be as a result of missing out on technologies for the sake of reduced capital investments (car costs/technology costs)

also it could be noted that German & Japanese cars have remained to be low weight, and made significant transitions towards improved efficiencies, while American cars slow in catching up

3. Clustering:
G. Pass a new DataPoint and predict which cluster it belongs to.

the above datapoints have been aptly put in BLUE CLUSTER for they are old aged, poor efficiency cars, primarily of American origin (1)

succesfully our model predicted our intended datapoint in to PURPLE CLUSTER as expected

Part TWO

DOMAIN: Automobile
CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
• All the features are numeric i.e. geometric features extracted from the silhouette.
PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model and compare relative results.

1. Data Understanding & Cleaning:
A. Read ‘vehicle.csv’ and save as DataFrame.
B. Check percentage of missing values and impute with correct approach.

there are less than 1% of missing values in above listed columns
Lets impute them appropriately

Missing values are successfully imputed without disturbing the data distribution

1. Data Understanding & Cleaning:
C. Visualize a Pie-chart and print percentage of values for variable ‘class’.

based on problem description and the above plot, one may assume that the dataset contains approximately equal datapoint for bus, van and 2 of the cars

1. Data Understanding & Cleaning:
D. Check for duplicate rows in the data and impute with correct approach.

no duplicate rows found, lets check for duplicates after dropping class column

still no duplicates found, hence the dataset is a collection of unique datpoints.

2. Data Preparation:
A. Split data into X and Y. [Train and Test optional]
B. Standardize the Data.

all attributes are numeric, lets proceed with standardisation

the scales of features varies, lower limits ranges from 0 to 200, while upper limits ranges from 30 to 1000
hence scaling is necessary

the features are rescaled between -2 to 3, except for outliers in
max.length_aspect_ratio,
pr.axis_aspect_ratio,
scaled_radius_of_gyration.1 features. Those columns may be need further processing based on model needs

3. Model Building:
A. Train a base Classification model using SVM.
B. Print Classification metrics for train data.

3. Model Building:
C. Apply PCA on the data with 10 components.
D. Visualize Cumulative Variance Explained with Number of Components.
E. Draw a horizontal line on the above plot to highlight the threshold of 90%.
F. Apply PCA on the data. This time Select Minimum Components with 90% or above variance explained.

the above visualsation shows that 5 components will be capable of explaining 90% of the variance in the data
hence lets choose 5 components for further modeling

3. Model Building:
G. Train SVM model on components selected from above step.
H. Print Classification metrics for train data of above model and share insights.

for the given dataset, the SVM on full features had performed better in terms of accuracy and almost every other score compared to the results after PCA
this is direct result of dimentionality reduction that we perform by PCA

also PCA has not reduced any computational times
this is probably due to the fact that the dataset is not huge or does not contain very large features, in order to benefit from the PCA

4. Performance Improvement:
A. Train another SVM on the components out of PCA. Tune the parameters to improve performance.
B. Share best Parameters observed from above step.
C. Print Classification metrics for train data of above model and share relative improvement in performance in all the models along with insights.

log of serveral runs
{'kernel': 'rbf', 'gamma': 1.4000000000000004, 'C': 3.31}
{'kernel': 'rbf', 'gamma': 1.4000000000000004, 'C': 2.81}
{'kernel': 'linear', 'gamma': 8.000000000000007, 'C': 1.36}
{'kernel': 'linear', 'gamma': 7.600000000000006, 'C': 1.9100000000000001}
{'kernel': 'rbf', 'gamma': 1.7000000000000006, 'C': 2.96}
{'kernel': 'rbf', 'gamma': 3.9000000000000026, 'C': 2.46}
{'kernel': 'rbf', 'gamma': 1.8000000000000007, 'C': 4.71}
{'kernel': 'rbf', 'gamma': 1.1, 'C': 0.6100000000000001}
{'kernel': 'rbf', 'gamma': 1.4000000000000004, 'C': 1.4600000000000002}
{'kernel': 'rbf', 'gamma': 1.4000000000000004, 'C': 2.31}
{'kernel': 'rbf', 'gamma': 2.8000000000000016, 'C': 1.76}


based on the above log, lets run a narrowed down gridsearch

it could be found from above scores log that Hypertuning has secured 92% accuracy in training data.
the performance on testing data is quite low at 79%, but has improved from 74% of untuned model
other classifer models might be attempted on getting generalised performance on training & test data

to comment about PCA, yes PCA with model hypertuning can help obtain reasonably good accuracy (or other scores)
but a justified decision based on available cumputational capacities will help

5. Data Understanding & Cleaning:
A. Explain pre-requisite/assumptions of PCA.
B. Explain advantages and limitations of PCA.

the PCA is performed on independant variables considering the following pre-requisites / assumptions
• exlainability of the model built is not of much imoprtance
• there are only linear relations among the attributes
• the features are prescaled to avoid capturing of high scale feature as priority
• loss of accuracy at a miniscule level is acceptable compared to the benefits of computational costs
• skewness in features are properly treated to avoid impact of PCA

Accordingly in our modeling,
we have not focussed on explanation of model fit
relationships among the attributes needs to be studied in detail
features were scaled with z-score using StandardScaler
3 features had skewness, impact of it on the PCA needs to be studied

as could ne seen, several feature pairs are exhibiting linear relations, certain pair doesn't show any relation, while few are also showing curvy relations which are non-linear. this could be one major cause of loss of accuracy

Dimensionality reduction (PCA as a feature extraction method) provides the following advantages
• Due to orthoganality of the priciple components, multi collinearity of source data is eliminated
• Due to reduced attributes to learn upon, computational costs of learning a dataset reduces (in terms of processing times)
• Also, since attributes are reduced, lesser memory is sufficient thus further reducing computational costs
• Provides abstract summary in much lesser dimensions (even though unxplainable) enhancing visualisations
• reduction in number of attributes inherently improves the aspect ratio of the dataset, thus avoiding curse of dimensionality
• Since principle components capture more of information and less of noise, over-fitting is avoided
• While PCA brings down the computational costs, it manages to maintain the largest variances in the data, thus not causing much of a information loss

The limitation of PCA could be summarised as follows
• since PCA tries to capture largest variances in the data, unscaled data could lead to erroneous outcomes
• changing the scaling methodology could affect PCA largely
• Since PCA combines information from various attributes, it will not posses meaningful relation to real world, thus looses explainability of the learned model and its coefficients
• skewness in data (with thick tails) can affect PCA's performance
• non-linear relations in the data are not captured by PCA, thhus could cause loss of informtion